Update japan_dict.txt #13142

madmalkav · 2024-06-20T09:29:09Z

Update japan_dict.txt to include missing jouyou kanji ( #12940 )

Update japan_dict.txt to include missing jouyou kanji

CLAassistant · 2024-06-20T09:29:15Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

GreatV · 2024-06-20T13:28:20Z

Will this affect the previous jp model?

madmalkav · 2024-06-20T13:34:26Z

It is my understanding this file is only used when training a new model, so I don't think it would have any effect until a new model is trained, but there is a chance I misunderstood everything wrong and it works in a different way 😅

GreatV

Since they use the same character dictionary, using the previous model may result in inconsistent output, and I think it would be better to use a different filename (e.g., ja_ext_dict.txt).

madmalkav · 2024-06-20T14:04:58Z

If this file can indeed affect the current model, I would propose to hold this PR until a new version of the japanese model is going to be trained. I don't like too much the idea of creating a new file and called it "extended" because this is not really extending the dictionary, is fixing a fault in it.

GreatV · 2024-06-20T14:17:51Z

@madmalkav, That makes sense.

madmalkav · 2024-08-18T10:58:43Z

Can I ask why this similar PR for another language was already merged? What is the difference with my PR? I want to understand to see if I can do something else to move this forward.

GreatV · 2024-08-18T13:58:11Z

I don't think that PR being merged will work properly with the previous model.

GreatV · 2024-08-19T01:24:58Z

Maybe we could try to keep the version information of the character dictionary (a new column in the model list is used to show the dictionary used by the model), so that new character dictionaries can be merged in, and old models use old character dictionaries.

model code	description	character dictionary	model size	download	Update Date
ch	Chinese and English	https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/utils/ppocr_keys_v1.txt	3.71M	inference model/ trained model	2020.9.22
ch	Chinese and English	https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.8/ppocr/utils/ppocr_keys_v1.txt	3M	inference model/ trained model	2024.9.22

Update japan_dict.txt

c9fa60a

Update japan_dict.txt to include missing jouyou kanji

GreatV reviewed Jun 20, 2024

View reviewed changes

GreatV requested a review from jzhang533 June 20, 2024 14:02

GreatV added language requests Multilingual language requests contributor labels Jun 23, 2024

uyeongjae mentioned this pull request Jun 27, 2024

The Korean language model fails to recognize '.' because it is missing. #13147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update japan_dict.txt #13142

Update japan_dict.txt #13142

madmalkav commented Jun 20, 2024

CLAassistant commented Jun 20, 2024 •

edited

Loading

GreatV commented Jun 20, 2024

madmalkav commented Jun 20, 2024

GreatV left a comment

madmalkav commented Jun 20, 2024

GreatV commented Jun 20, 2024

madmalkav commented Aug 18, 2024

GreatV commented Aug 18, 2024

GreatV commented Aug 19, 2024

Update japan_dict.txt #13142

Are you sure you want to change the base?

Update japan_dict.txt #13142

Conversation

madmalkav commented Jun 20, 2024

CLAassistant commented Jun 20, 2024 • edited Loading

GreatV commented Jun 20, 2024

madmalkav commented Jun 20, 2024

GreatV left a comment

Choose a reason for hiding this comment

madmalkav commented Jun 20, 2024

GreatV commented Jun 20, 2024

madmalkav commented Aug 18, 2024

GreatV commented Aug 18, 2024

GreatV commented Aug 19, 2024

CLAassistant commented Jun 20, 2024 •

edited

Loading